Goto

Collaborating Authors

 metadata extraction


Metadata Extraction Leveraging Large Language Models

Han, Cuize, Jalagam, Sesh

arXiv.org Machine Learning

The advent of Large Language Models has revolutionized tasks across domains, including the automation of legal document analysis, a critical component of modern contract management systems. This paper presents a comprehensive implementation of LLM-enhanced metadata extraction for contract review, focusing on the automatic detection and annotation of salient legal clauses. Leveraging both the publicly available Contract Understanding Atticus Dataset (CUAD) and proprietary contract datasets, our work demonstrates the integration of advanced LLM methodologies with practical applications. We identify three pivotal elements for optimizing metadata extraction: robust text conversion, strategic chunk selection, and advanced LLM-specific techniques, including Chain of Thought (CoT) prompting and structured tool calling. The results from our experiments highlight the substantial improvements in clause identification accuracy and efficiency. Our approach shows promise in reducing the time and cost associated with contract review while maintaining high accuracy in legal clause identification. The results suggest that carefully optimized LLM systems could serve as valuable tools for legal professionals, potentially increasing access to efficient contract review services for organizations of all sizes.


Muse-it: A Tool for Analyzing Music Discourse on Reddit

Agarwala, Jatin, Paul, George, Vardhan, Nemani Harsha, Alluri, Vinoo

arXiv.org Artificial Intelligence

Music engagement spans diverse interactions with music, from selection and emotional response to its impact on behavior, identity, and social connections. Social media platforms provide spaces where such engagement can be observed in natural, unprompted conversations. Advances in natural language processing (NLP) and big data analytics make it possible to analyze these discussions at scale, extending music research to broader contexts. Reddit, in particular, offers anonymity that encourages diverse participation and yields rich discourse on music in ecological settings. Yet the scale of this data requires tools to extract, process, and analyze it effectively. We present Muse-it, a platform that retrieves comprehensive Reddit data centered on user-defined queries. It aggregates posts from across subreddits, supports topic modeling, temporal trend analysis, and clustering, and enables efficient study of large-scale discourse. Muse-it also identifies music-related hyperlinks (e.g., Spotify), retrieves track-level metadata such as artist, album, release date, genre, popularity, and lyrics, and links these to the discussions. An interactive interface provides dynamic visualizations of the collected data. Muse-it thus offers an accessible way for music researchers to gather and analyze big data, opening new avenues for understanding music engagement as it naturally unfolds online.


Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Boukhers, Zeyd, Yang, Cong

arXiv.org Artificial Intelligence

The availability of metadata for scientific documents is pivotal in propelling scientific knowledge forward and for adhering to the FAIR principles (i.e. Findability, Accessibility, Interoperability, and Reusability) of research findings. However, the lack of sufficient metadata in published documents, particularly those from smaller and mid-sized publishers, hinders their accessibility. This issue is widespread in some disciplines, such as the German Social Sciences, where publications often employ diverse templates. To address this challenge, our study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance. We aim to improve the accessibility of scientific documents and facilitate their wider use. To support our comparison of these methods, we provide comprehensive experimental results, analyzing their accuracy and efficiency in extracting metadata. Additionally, we provide valuable insights into the strengths and weaknesses of various feature learning and prediction methods, which can guide future research in this field.


Supercharge Content Intelligence with AI

#artificialintelligence

Artificial intelligence (AI) creates abundant opportunities for a wide range of intelligent, automated business operations. Two vital capabilities--metadata extraction and data enrichment--rank among the most valuable, commonly used functions for businesses seeking to harness immediate value from organizational data and content. AI-driven techniques for rapidly sorting, filtering, categorizing, and adding context to massive volumes of data can help deliver a distinct business advantage. By combining accessible, cloud-based AI services and customizable, specialized AI tools and training, businesses can shape data and content services to better meet their objectives. Despite the accelerating, never-ending spiral of accumulating content, most businesses aren't gaining the insights they need nor seeing visible operational benefits, as asserted in a Software Development Times article.


An Agent based Approach towards Metadata Extraction, Modelling and Information Retrieval over the Web

Ahmed, Zeeshan, Gerhard, Detlef

arXiv.org Artificial Intelligence

Web development is a challenging research area for its creativity and complexity. The existing raised key challenge in web technology technologic development is the presentation of data in machine read and process able format to take advantage in knowledge based information extraction and maintenance [4]. Currently it is not possible to search and extract optimized results using full text queries because there is no such mechanism exists which can fully extract the semantic from full text queries and then look for particular knowledge based information. Mechanism of presenting information over the web in a format so that the humans as well as machines can understand the context leads to the concept of Semantic Web introduced by Tim Berners Lee [4]. Semantic web is a linked mesh of information to produce technologies capable of reasoning on semi structured information and processed by machines [4].